Survival prediction of gastric cancer patients by Artificial Neural Network model.

Aim
This study aims to predict survival rate of gastric cancer patients and identify the effective factors related to it, using artificial neural network model.


Background
Gastric cancer is the most deadly disease in north and northeast provinces of Iran. A total of 430 patients with gastric cancer who referred to Baghban clinic in Sari, from early November 2006 to late October 2013 were followed.


Methods
A historical cohort of patients who referred to Baghban Clinic, the cancer research center of Mazandaran University of Medical Sciences in Sari, from early November 2006 to late October 2013 was studied. Three groups of variables (demographic, biological and socio-economic) were studied. Survival rate and effective factors on survival time were calculated using Kaplan-Meier methods and artificial neural networks and the best network structure were chosen using the mean square error and ROC curve. All analyses were performed using SPSS v.18.0 and the level of significance was selected α=0.05.


Results
In this research, the median survival time was 19±2.04 months. The 1 to 5-year survival rates for patients were 0.64, 0.44, 0.34, 0.24 and 0.19, respectively. The percentage of right predictions of the selected network and the area under the ROC curve were 92% and 94%, respectively. According to the results, the type of treatment, metastasis, stage of disease, histology grade, histology type and the age of diagnosis were effective factors on survival period.


Conclusion
the 5 years survival rate of gastric cancer patients in Mazandaran is lower than other provinces which could be due to the delay in diagnosis or patient's referral. Therefore, the use of screening methods and early diagnosis could be influential for improving survival rate of these patients.


Introduction
Cancer is caused by the uncontrollable growth of cells that is regarded as the second most common cause of death in developed countries and the third in developing ones (1). Among cancers, gastric cancer with the mortality rate of 15.5% turned out to be the deadliest cancer in 2012 (2). This type of cancer with the advanced stage, so that survival of patients would be low (3). Many studies have been carried out on gastric cancer, survival analysis of the patients and the identification of risk factors of this disease. But most of the statistical models used in them, such as Cox proportional hazards model and parametric models, make assumptions such as establishing normal distribution for response variable, the linear relationship between independent variables and response variable, similarity of errors, etc. for data distribution , while these assumptions are not applicable in many cases. Artificial neural network models do not consider any assumption for distribution of data, and they could model complex nonlinear relationships and high-grade interactive effects based on internal relationships, without prejudices about any form of distribution (9,10). In this method there is also the possibility that the malfunctioning of a part of neurons would not cause the complete breakdown of the network, and yet it would be likely to make right decisions (11) .In addition, interoperability of the model allows providing an appropriate response to patient's situation based on the new condition risk (12) .This study was carried out to estimate the survival rate of patients with gastric cancer and determine the influential factors by using artificial neural network.

Methods
This is a historical cohort study. A total number of 430 patients with gastric cancer who referred to Baghban clinic in Sari, from early November 2006 to late October 2013 was studied. Patients diagnosed with the disease by the physicians and those with less than 50% of available information were excluded. Patient's information during and after the conduct of the investigation is kept confidential and is not available to others. Survival period was considered as the dependent variable. Also age, gender, body mass index (BMI), high risk dietary habits (including consuming high-calorie foods, salty and smoked foods, low amounts of fruit and vegetables, frozen meals, high amounts of salt and drinking hot tea, all of which yes or no), family history (yes or no), history of chronic diseases (yes or no), history of smoking and alcohol (yes or no), occupation (including gardener, farmer, miner (coal), one who works with toxic spills, housewife etc.), disease histology (wound, ulcer etc.), histology type (including adenocarcinoma etc.), grade of histopathology differentiation (moderate and good), tumor stage (including early stages, localized enlargement, distant metastasis), tumor size (less or more than 5cm), tumor location (Cardia, Fundus, Stomach, Antrum, Greater curvature, Lesser curvature, more than one site), type of treatment (surgery with radiotherapy and chemotherapy, radiotherapy and chemotherapy, chemotherapy, without treatment) and exposure to chemicals (yes or no) were examined as independent variables. All of these variables are extracted from medical records and the last health status of patients got to be known through phone calls and were recorded in the provided check lists, and patients' survival time were calculated in terms of months.
Data was analyzed by using SPSS v.18.0 and the level of significance was considered α=0.05. Missing observations were estimated using regression method. Then, Cox proportional hazards model and Kaplan-Meier nonparametric methods were used for data analysis. Comparison of survival rates was made by logrank test. The final analysis of data was performed by artificial neural network pattern on significant variables.
Artificial neural networks are computational tool inspired by the human brain and are a part of dynamical systems that transfer knowledge or rules concealed in the data to the network structure by processing the experimental data. Neuron is the smallest unit of information processing that form the basis of neural network performance. All artificial neural networks are divided into two categories: supervised and unsupervised learning systems.
Learning systems are those systems that could present appropriate behaviors depending on conditions according to the available models, and could improve their performance in order to achieve a specific purpose only through observing system's operation. The system starts the process by the random selection of initial weights, and then continues the process of training and learning.
A neural network normally has three layers: input, intermediate (hidden) and output. All of the input layers information are transferred to the output layer in a layered way. Input layers could be output for the other layer or as raw data in the first layer in the form of numerical data, literary texts, images etc.
The main task of the middle layer is to extract classified information from the existing data. Also the output layer shows the final output of the network. For analyzing with this method, firstly data were randomly divided into two parts: training and testing. What is important in neural networks is the proper choice of weights and bias sections of the network if needed. Choosing the weights is known as learning algorithms and is regarded as a key part of network distinctions in the methodologies of their parameter setting (13). To fit neural network model, first, censored patients were separated and non-censored patients were divided into two groups, 211 for training and 72 for validation group. To ensure that there is no significant difference in the distribution of independent variables between the two groups, chi-square statistics were used, and no significant difference was found between the two sets of data for the distribution of independent variables.
The observed survival rates of the two groups were tested by log rank test and no significant difference was shown between the median survivals. For fitness of the ANN model, a three-layer of neural network including 17 input nodes, three hidden nodes and two output nodes were selected as architecture of the network. Since the output of the network, i.e. the status of each patient, is a binary variable, we applied sigmoid function as the activation function of the output layer. By using data sets of training and supervised back-propagation algorithm of learning, neural network was trained; and the training process was stopped when no reduction was made in the error of the test group. Also the sigmoid function was considered as activation function of the hidden layer. The mean square error and Receiver Operating Characteristic (ROC) curve were used as indicators for determining the best network.

Results
Among the 430 patients with gastric cancer, 296 (68.6%) cases were male and 134 (31.4%) were female, so the proportion of male to female was 2/2. The total average age of the studied patients was (64.45 ±13.56) years (65.98 ±12.22 for male and 61.12 ±15.66 for female). 9% of the patients were under the age of 45 and 56% were over 65 years.
Based on Job information available for 258 patients, 100 men were farmers and stockmen and 95 women were housewives. 52.1% of patients lived in urban areas and 47.9 % in rural areas. 36% of patients were smokers and 7% of them had a history of consuming alcohol. Also in the case of dietary habits, 24% of the patients had salt intake, 59.1% had a high consumption of hot tea, 12.9% had frozen meals in their diet, 14.3% consumed high-calorie foods, 5% had salty and smoked food in their diet, and 33.7 % consumed very small amounts of fruit and vegetables in their diets. The location of the tumor in 64 cases (24.8%) of patients was in the cardiac and for 62 patients (24%) was in the gastric antrum. According to the available information of the type of tumor for 179 patients, in 79.9% of cases the tumor has appeared as a scar and ulcer. Among 75 patients whose tumor size was recorded in their pathology sheet, 38 patients had tumors larger than 5 cm. Also in 286 patients whose disease progression was mentioned in their records, 214 patients (74.8%) were diagnosed in the advanced stage of disease (stages 3 and 4). Patients were studied in terms of their symptoms on diagnosis and earlier, and the results are presented in Table 1. Survival curves for the stages of disease shows that the probability of survival in more patients with older age, weaker histology grade, metastases at diagnosis, advanced stages of cancer, rural residence, adenocarcinoma histology type and patients who their first treatment was a non-surgical procedure, had lower survival. Survival curves for the stages of disease show that the probability of survival period in advanced stages of disease have been steeper than the early stages ( Figure 1).
Patients of older ages, weaker histology grade, metastases at diagnosis, advanced stages of cancer, rural residence, adenocarcinoma histology type and those whose first treatment was a non-surgical procedure, had lower survival periods. There was no significant relationship between the survival time and factors such as gender, body mass index, patient's occupation, history of smoking, history of alcohol use, history of cancer in relatives, history of chronic gastrointestinal disease,  Figure 1. Flowchart of classification of gastric cancer patients marital status, consumption of high-calorie, salted and smoked meals, low intake of fruits and vegetables, tumor location, tumor type, tumor size and type of organ metastasis. Similarly, no statistically significant relevance was found between survival time and the consumption of too much salt (p =0.077) and hot tea (p =0.178) and frozen meals (p =0.071), but they found to be closely related.
Before using a set of training and testing, chi-square statistics were used to ensure that there is no significant difference in the distribution of independent variables in the two groups, and with respect to its probability (p =0.929) no significant distinction was observed. The results of this comparison is shown in Table 2. The survival rate in the two groups was tested by log rank test and the result was not significant for the median survival of groups.
The Kaplan -Mayer graph was drawn up for the observed survival of all patients in the groups, which confirms the mentioned claim ( Figure 2). To fit ANN model, a three-layer neural network including 17 input nodes, three-hidden nodes and twooutput nodes were chosen as network architecture. Since the output of the network, i.e. the status of each patient, is a binary variable, sigmoid function was used as an activation function of the output layer.
Considering the importance of the independent variables, treatment variable with standard of 100% and then the stage of disease progression with 93.3% and the age with 16% were found as the most important and least important variables, respectively ( Table 3). The percentage of the correct predictions of the present network and ROC curve were 92%, 94%, respectively.

Discussion
In the present study 296 (68.6%) of the studied patients are male and 134 (31.4%) are female, thus the gender ratio is 2.2, that corresponds to the similar studies conducted in Ardabil, Fars and Tehran (14,15,16). This study found the most age prevalence to be in the seventh decade of life and the results of other studies confirm our findings (17,18). The total average age of patients was about 64.45 years (65.98 for male and 61.12 for female) that is higher than the average age estimation in other studies (15, [19][20][21]. The results of log rank test showed that there is a significant distinction between the longevity of patients  (14,15,29). As expected, histology grade variable in this study is known as an influential factor on survival in gastric cancer. This means that patients who are diagnosed with well-differentiated grade level have lower risk of death and this is confirmed by studies in Japan and Spain (24,25). A study by Pourhoseingholi and colleagues in Iran also showes that patients with a lower degree of differentiation encounter a higher risk of death (21). Presence or absence of metastasis is significantly associated with survival time that is confirmed by studies in other countries and is consistent with the results of studies carried out by Moghimi Dehkordi and Biglarian (15,29). In this research, treatment was recognized as a factor affecting survival of patients, so patients who have used surgical treatment with chemotherapy and radiotherapy have a higher survival rate than patients who had surgery with chemotherapy. Also patients who have been treated surgically with chemotherapy have higher survival rate than patients who were treated just surgically. Studies in North America, China and Europe have proved the complementary treatment effect of chemotherapy and radiotherapy on patients' survival (26,28).
In 2007, the American Cancer Research Association announced that the consumption of certain meals could increase the risk of incidence and development of the disease (30). A study on the immigrant population has also emphasized the role of dietary factors as one of the most important causes of gastric cancer. Some epidemiological, case-control, and cohort studies suggest that the risk of this cancer is increased with the consumption of highly salted meals, salted and processed meat, and decreases with the high consumption of fruits and vegetables (31,32). In this study, 24% of patients had a history of salt intake, 59.1% had a high consumption of hot tea, 12.9%, used to eat frozen meals, 14.3% were accustomed to use highquality meals, 5% profitable and smoked foods, and 33.7 percent have been taking very small amounts of fruits and vegetables. In this research, the age of cancer diagnosis also found to be an influential factor of survival and patients whose disease is diagnosed at a younger age, have higher survival than others. This may be due to the lower progression of disease or the better physical condition at younger ages. These results are consistent with studies done by Yazdani, Pourhoseingholi, Moghimi Dehkordi and other projects conducted in other countries, but is contradicted with a study conducted in Mazandaran province (15,22,25,33). Some missing information in patients' records, incomplete pathology reports and the failure in registering certain important information such as the progression of the disease, histology grade and tumor size in patient's files, as well as not having the access to patients or their families due to changes in contact information were some limitations of this study. Using precise statistical method for predicting patient survival and identifying related factors could be considered as the strengths of this study. In this study, artificial neural network model was used to predict the survival of patients with gastric cancer and the results showed that treatment type with standard of 100% and the disease progression stage with 93.3% were the most important independent variables, and age with 16%, was of the least importance. Distant metastasis and disease progression variables have been removed from the final output because of the less importance of network. Finally, based on the study results, we found that the 5-year survival rate of patients with gastric cancer in Sari is low, the reason of which could be the delays in diagnosis and referral. Therefore, the use of screening methods and early diagnosis could be influential for improving survival of these patients.